225 research outputs found
Automatic detection of borrowing (Open problems in computational diversity linguistics 2)
This is the third of a series of 12 blog posts published in 2019, discussing open problems in computational diversity linguistics. It discusses the problem of automatic borrowing detection
Automatic sound law induction (Open problems in computational diversity linguistics 3)
This is the fourth of a series of 12 blog posts published in 2019, discussing open problems in computational diversity linguistics. It discusses the problem of automatic sound law induction
A Computational Model for the Assessment of Mutual Intelligibility Among Closely Related Languages
Closely related languages show linguistic similarities that allow speakers of
one language to understand speakers of another language without having actively
learned it. Mutual intelligibility varies in degree and is typically tested in
psycholinguistic experiments. To study mutual intelligibility computationally,
we propose a computer-assisted method using the Linear Discriminative Learner,
a computational model developed to approximate the cognitive processes by which
humans learn languages, which we expand with multilingual semantic vectors and
multilingual sound classes. We test the model on cognate data from German,
Dutch, and English, three closely related Germanic languages. We find that our
model's comprehension accuracy depends on 1) the automatic trimming of
inflections and 2) the language pair for which comprehension is tested. Our
multilingual modelling approach does not only offer new methodological findings
for automatic testing of mutual intelligibility across languages but also
extends the use of Linear Discriminative Learning to multilingual settings.Comment: To appear in: Proceedings of the 6th Workshop on Research in
Computational Linguistic Typology and Multilingual NLP (SIGTYP 2024
Bangime: secret language, language isolate, or language island? A computer‐assisted case study
We report the results of a qualitative and quantitative lexical comparison between Bangime and neighboring languages. Our results indicate that the status of the language as an isolate remains viable, and that Bangime speakers have had different levels of language contact with other Malian populations at various points throughout their history. Bangime speakers, the Bangande, claim Dogon ancestry. The Bangande portray this connection to Dogon through the fact that the language has both recent borrowings from neighboring Dogon varieties and more rooted vocabulary from Dogon languages spoken to the east from whence the Bangande claim to have come. Evidence of multilayered long‐term contact is clear: lexical items have even permeated even core vocabulary. However, strikingly, the Bangande are seemingly unaware that their language is not intelligible with any Dogon variety. We hope that our fiindings will influence future studies on the reconstruction of the Dogon languages and other neighboring language varieties to shed light on the mysterious history of Bangime and its speakers
Trimming Phonetic Alignments Improves the Inference of Sound Correspondence Patterns from Multilingual Wordlists
Sound correspondence patterns form the basis of cognate detection and
phonological reconstruction in historical language comparison. Methods for the
automatic inference of correspondence patterns from phonetically aligned
cognate sets have been proposed, but their application to multilingual
wordlists requires extremely well annotated datasets. Since annotation is
tedious and time consuming, it would be desirable to find ways to improve
aligned cognate data automatically. Taking inspiration from trimming techniques
in evolutionary biology, which improve alignments by excluding problematic
sites, we propose a workflow that trims phonetic alignments in comparative
linguistics prior to the inference of correspondence patterns. Testing these
techniques on a large standardized collection of ten datasets with expert
annotations from different language families, we find that the best trimming
technique substantially improves the overall consistency of the alignments. The
results show a clear increase in the proportion of frequent correspondence
patterns and words exhibiting regular cognate relations.Comment: The paper was accepted at the SIGTYP workshop 2023 co-located with
EAC
Statistical proof of language relatedness (Open problems in computational diversity linguistics 7)
This is the eighth of a series of 12 blog posts published in 2019, discussing open problems in computational diversity linguistics. It discusses the problem of statistical proof of language relatedness
Formal and quantitative approaches to historical language comparison
Lecture, given at the Fifth Pavia International Summer School for Indo-European Linguistics (Università di Pavia, 2022-09-05/09
Save the trees
Skepticism regarding the tree model has a long tradition in historical linguistics. Although scholars have emphasized that the tree model and its long-standing counterpart, the wave theory, are not necessarily incompatible, the opinion that family trees are unrealistic and should be completely abandoned in the field of historical linguistics has always enjoyed a certain popularity. This skepticism has further increased with the advent of recently proposed techniques for data visualization which seem to confirm that we can study language history without trees. In this article, we show that the concrete arguments that have been brought up in favor of achronistic wave models do not hold. By comparing the phenomenon of incomplete lineage sorting in biology with processes in linguistics, we show that data which do not seem as though they can be explained using trees can indeed be explained without turning to diffusion as an explanation. At the same time, methodological limits in historical reconstruction might easily lead to an overestimation of regularity, which may in turn appear as conflicting patterns when the researcher is trying to reconstruct a coherent phylogeny. We illustrate how, in several instances, trees can benefit language comparison, although we also discuss their shortcomings in modeling mixed languages. While acknowledging that not all aspects of language history are tree-like, and that integrated models which capture both vertical and lateral language relations may depict language history more realistically than trees do, we conclude that all models claiming that vertical language relations can be completely ignored are essentially wrong: either they still tacitly draw upon family trees or they only provide a static display of data and thus fail to model temporal aspects of language history
Multiple sequence alignment in historical linguistics. A sound class based approach
In this paper, a new method for multiple sequence alignment in historical linguistics is presented. The algorithm is based on the traditional framework of progressive multiple sequence alignment (cf. Durbin et al. 2002:143-149) whose shortcomings are further enhanced by (1) a sound class representation of phonetic sequences (cf. Dolgopolsky 1986, Turchin et al. 2010) accompanied by specific scoring functions, (2) the modification of gap scores based on prosodic context, (3) a new method for the detection of swapped sites in already aligned sequences. The algorithm is implemented as part of the LingPy library (http://lingulist.de/lingpy), a suite of open source Python modules for various tasks in quantitative historical linguistics. The method was tested on a benchmark dataset of 152 manually edited multiple alignments covering data for 192 Bulgarian dialects (Prokić et al. 2009). The results show that the new method yields alignments which differ only in 5 % of all sequences from the gold standard
- …